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ABSTRACT 


This paper provides a guide for using the Kolmogorov- 
Smirnov Goodness of Fit Test when testing for normality; 
especially in cases where the mean and variance must be 
estimated from the sample. 


INTRODUCTION . 

An effective method for testing the hypothesis that a 
set of data comes from a specified distribution is the 
Kolmogorov-Smirnov Goodness of Fit Test. It has been shown 
(refs. 1 and 2) that if [x^] i = 1,2, •••n represents a 
sample of n independent observations hypothesized to be 
from a population with cumulative distribution function 
F (x) , one may test this hypothesis by computing the 
"maximum deviation" statistic D given by 


where F (x) 
n ' 


D = mgx |F n (x) - F(x) | 
no. of observations less than or equal to x 


Under the assumed hypothesis, the distribution of D is 
independent of the function F , and its percentage points 
may be found in standard textbooks and tables, but with the 
restriction (not usually mentioned) that the assumed dis- 
tribution function F must be completely specified. If 
F contains parameters which have to be .estimated from the 
sample, the distribution of D is different than that given 
in standard tables, and in fact depends on F . 


i 
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Lillifors (ref. 3) discusses the case where F is the 
normal distribution function with the mean and variance 
estimated from the sample, and gives percentage points of 
D as estimated by Monte Carlo runs of size 1000. 


The purpose of this paper is to: 1) publicize the 

results of reference 1 with improved accuracy, 2) give per- 
centage points of D for the normal case where the mean is 
known and the variance is unknown, and 3) point out the 
existence of a computer program KOLSMR (Kolmogorov-Smirnov) 
which may be used to perform the K-S test at MSC. 

SYMBOL TABLE 


x i (i=l,2,***n) — raw data 


x ( i),X( k) ith ,kth — ordered data 


n 

F 


D 


F (x) 
n 


a 


— sample size 

— theoretical distribution function 

— maximum deviation 

— estimate of F based on sample of 
size n 

— probability 


C na - l~a percentage point of the null 

distribution of D for a sample of 
• size n 

F n (x-0) - limit of F n (t) as t+x through 

values less than x 

F n ( x+ °) - limit of F n (t) as t-*-x ' through 

values greater than x 


i 


2 


y . , z . 
x 


s , s 
PHI 
CDF 


* 


known mean 

standardized variates 
sample mean 

sample standard deviations 

standard normal distribution function 

(cumulative) distribution function 
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MEAN AND VARIANCE UNKNOWN 


The following table gives estimates and 99 percent con- 
fidence limits for percentage points of D under the i*ull 
hypothesis of normality when the mean and variance must be 
estimated from the data. The estimates of the critical 


values, , for sample sizes n = 3 , 4 , 5, 10, 15, 20, 25, 

and 30 and significance levels a ■ .10, .05, and .01 are 
obtained by ordering 10 000 random values of D,**say D , 


and choosing the 10 000(l-a)tfc of the 


D • • • • 

( 2 ) (10 000 ) 9 
D,. v ; thus, if a s .05 , the estimate of C is 

D. . . These results, based on a larger number of runs 

than those in reference 1, have fairly good accuracy. Ninety- 

five percent confidence limits for C are shown alongside 

. na 

the estimate. 


TABLE I - CRITICAL VALUES OF D WHEN TESTING FOR 

NORMALITY WITH MEAN AND VARIANCE UNKNOWN. 


10 000 
Trials 

a 

O 

H 

• 

u 


a = .05 

a 

= . 01 






99i 

991 

~59l 


T9T 

n 

Low 

a 

C 

na 

High 

Low C 

na 

High 

Low 

A 

c 

na 

High 

3 

.3659 

.3672 

.3689 

.3748 .3760 

.3770 

.3826 

.3831 

.3835 

4 

.3402 

.3447 

.3497 

.3702 .3741 

.3774 

.4064 

.4098 

.4148 

5 

.3148 

.3183 

.3213 

.3394 .3435 

.3474 

.3873 

.3928 

.4015 

10 

. 2399 

. 2424 

. 2454 

.2617 .2642 

.2684 

.2988 

.3035 

.3096 

15 

. 2005 

. 2022 

. 2051 

.2177 .2205 

. 2245 

. 2531 

. 2589 

.2646 

20 

.1740 

.1756 

.1773 

.1891 .1915 

.1942 

.2188 

. 2236 

. 2274 

25 

.1579 

.1594 

.1612 

.1712 .1730 

.1753 

.1973 

.2014 

. 2058 

30 

.1450 

.1468 

.1483 

.1576 .1590 

.1612 

.1837 

.1885 

.1920 

over 30 


.805 


.86 


1 

.031 




✓n 


✓n 



✓n 



C *= estimate of C based on 10 000 Monte Carlo runs 
na na 


where C is such that Pr{D > C } = a for a 

na na 

sample of size n . 
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MEAN KNOWN AND VARIANCE UNKNOWN 


When the mean is assumed known, and the variance is 
unknown, simulations show that except for very small sample 
sizes (n < 3), the distribution of D is essentially the 
same as for the standard case where the mean and variance 
are known. 1 The standard table follows so that one might 
see the difference between this case and the unknown mean 
case (Table I). 


Note that the critical values in Table II are consid- 
erably higher than in Table I. Consequently, if one 
erroneously uses Table II when the mean and variance are 
estimated, there is only a small chance of rejecting the 
'null hypothesis even though it is false. For example, if 
the mean an'd variance are estimated from the sample, the 
probability of D exceeding .2205 for a sample of size 15 
is about .05 (Table I). Thus one would reject a null 
hypothesis of normality at the 5 percent level if his 
observed values of D were greater than .2205. Hence, if 
Table II were used one would not reject the null hypothesis 
unless the observed D were greater then .338, an event 
with probability much less than .05. 


For n = 3 , the exact values of C 3a for 

a = ...10 j .05, and .01 are .659, .726, and .798 when 
the variance is unknown. For n = 4 , the 99 percent 
confidence limits are (.556,. 569), (.611, .632) and 
(.718,. 741). Note that the standard values shown in 
Table II all lie in these confidence intervals. The 
same is true for larger values of n . 


TABLE II. - CRITICAL VALUES OE D WHEN TESTING 
FOR NORMALITY WITH MEAN KNOWN AND 
VARIANCE KNOWN (STANDARD TABLE) . 


n 

a ~ .10 
C 

na 

a = .05 
C 

na 

a = . 01 
C 

na 

3 

.642 

.708 

.829 

4 

.564 

.624 

.734 

5 

.510 

.563 

.669 

10 

.368 

.409 

.486 

15 

.304 

.338 

.404 

20' • ** 

.264 

•."294 * ■ ** 

' * 7352 

25 

.240 

.264 

.317 

30 

.220 

.242 

.290 

35 

.210 

.230 

.270 

oyer 35 

1.22 

1.36 

1.63 


✓n . 

/n 

/n 



THE COMPUTER PROGRAM KOLSMR 


The Theory § Analysis Office in the Computation § 
Analysis Division at MSC has a computer program KOLSMR, 
(ref. 4), which performs the Kolmogorov-Smirnov test on 
a set of data. The program prints the maximum deviation 
D , and its critical values for either the case when the 
distribution F is completely specified, or when F is 
normal with unknown mean and variance. 


Since the critical values are not distribution free when* 
estimating parameters, they may be invalid if F is not 
normal. However, if no parameters are to be estimated, then 
'the given values hold for any F . 


Calling Sequence 

The program is called by the following sequence: 
CALL KOLSMR (X,N,F,KW,KN S D) 

where: X is a singly dimensioned array of observations 

N is the sample size 

F is the theoretical distribution function 

D is the maximum deviation, i.e., 

D = max |F (x) - F (x) 
x 1 n 


KW is an indicator giving the following modes of output: 


KW < 0 


KW = 0 


The maximum deviation D and * 
a table of critical values is 
printed. 

Nothing is printed. 
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KW = k(k = 1,2, • 


l F( - x (k) 

and 

l F ^ X (k) 

with the appropriate sign of that difference. 

Finally, the maximum deviation D is printed with a 
table of critical values. 

KN is an indicator which functions as follows: 

1. When the theoretical distribution function F 
(normal or otherwise) is completely specified, 

KN should be set equal to 0. 

2. If one desires to test for normality with a known 
mean \x , but unknown variance, form the new 
observations y. = x.^ - y , set KN = 1 , and 
let F be the standard normal CDF. 


) Every k th ordered observation 
x^ is printed with the left 
and right limits of the estimated 
CDF, F(x-0) and F(x+0) , the 
true CDF, Ffx) , F and the 
maximum of the two differences : 


) - F (x ... “ 0) 
J (k) ' 


) - 


F (x „ , 
n (k ) 


+ 0) 


The program will compute standardized variates 


Z 


i 




U 


where 


S* 



U) 


2 


and test them against F . 

3. If one desires to test for normality with both mean 
and variance un) M own, set KN c 2 and test against 
the standard normal CDF. 


The program will compute the standardized variates 


where 




x 


S 


X 



and 



Tne Z will then be tested against the standard 
normal CDF. 
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EXAMPLE 


Figure 1 is an example of the output from KOLSMR. In 
this example, 40 random numbers, uniformly distributed 
between 0 and 10, were tested for normality without speci- 
fying the mean and variance. The numbers were stored in 
an array X , and KOLSMR was called as follows: 

CALL KOLSMR (X , 4 0 ,PHI , 1 , 2 ,D) . 

The function PHI is the standard normal CDF in the 
UNIVAC 1108 library and was declared EXTERNAL. * Since it 
was desired to print every value of X , KW was set equal 
to 1, and since this was a test for normality with unknown 
mean and variance, KN was set equal to 2. 


The columns 
that the X 


K and X 


^ are self-explanatory. Note 
^ appear in increasing order. The columns 
labeled FH(x-O) and FH(x+0) are the left-and right-hand 
limits as x->X of F_ (x) . This is equivalent to 




FH(x-O) = 


lc-1 

n 


FH(x+0) = - 

n 

For example, when k = 17 , FH(x-O) = jjj , FH(x+0) = , 

( x - x \ 

-^3 j = PHI 

.4,25539 . 

FH(x+0) - F(x) = -.000539 , FH(x-O) - F(x) = -.025539 . 


27909 - 6.014982 
2.594344 


DIFF is the above difference which is largest in magnitude; 
thus, DIFF = - .025539 . 


r 
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The maximum deviation is the largest absolute difference; 
thus, D = | - . 138899 | (the maximum deviation occurs at 
observation number 23) . 

If the original data were normally distributed (the null 
hypothesis), the probability of D being larger than .138 is 
only .05. Since the observed value of D was .139, we 
therefore reject the null hypothesis of normality at the 
5 percent level. 

The sample mean and standard deviation (x and s) are 
printed at the top of the output if KN » 2 . If KN = ' 1 , 
the mean is always printed as 0 . If KN » 0 , the mean is 
printed as 0, and the standard deviation is printed as 1. 
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10 

N*0VJ028 
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•250000 
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•020600 


1 1 

*•1*3163 

•2b0000 

•275000 

•235301 

•039699 


12 

N • 32442 I 

•275000 

• 300000 

•257593 
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13 

N • 3 b £ M | V 

• 300000 

• 325000 
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l N 
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•325000 
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IS 
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• 350000 
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Figure 

I. - EXAMPLE OF 

OUTPUT FROM 

KOLSMR 
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RESTRICTIONS 


1. If one is testing data for a distribution other than 
normal, the theoretical CDF, F , must be completely 
specified . 

2. The function F must be declared EXTERNAL in the 
calling program, 

OTHER INFORMATION 

The standard normal CDF PHI (X) is available to UNIVAC 
1108 users on the system library at MSC. The Theory and 
Analysis Office also has decks of other CDF’s, e.g., gamma 
beta, "t”, which may be used on the 1108. 
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